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Abstract 

Contextual bandit learning is a reinforcement 
learning problem where the learner repeatedly re- 
ceives a set of features (context), takes an action 
and receives a reward based on the action and 
context. We consider this problem under a re- 
alizability assumption: there exists a function in 
a (known) function class, always capable of pre- 
dicting the expected reward, given the action and 
context. Under this assumption, we show three 
things. We present a new algorithm — Regressor 
Elimination — with a regret similar to the ag- 
nostic setting (i.e. in the absence of realizabil- 
ity assumption). We prove a new lower bound 
showing no algorithm can achieve superior per- 
formance in the worst case even with the realiz- 
ability assumption. However, we do show that 
for any set of policies (mapping contexts to ac- 
tions), there is a distribution over rewards (given 
context) such that our new algorithm has con- 
stant regret unlike the previous approaches. 



1 Introduction 

We are interested in the online contextual bandit setting, 
where on each round we first see a context x E X, based 
on which we choose an action a E A, and then observe 
a reward r. This formalizes several natural scenarios. For 
example, a common task at major internet engines is to dis- 
play the best ad from a pool of options given some con- 
text such as information about the user, the page visited, 
the search query issued etc. The action set consists of the 
candidate ads and the reward is typically binary based on 
whether the user clicked the displayed ad or not. Another 
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natural application is the design of clinical trials in the med- 
ical domain. In this case, the actions are the treatment op- 
tions being compared, the context is the patient's medical 
record and reward is based on whether the recommended 
treatment is a success or not. 

Our goal in this setting is to compete with a particular set 
of policies, which are deterministic rules specifying which 
action to choose in each context. We note that this set- 
ting includes as special cases t he cla ssical iC-armed ban- 
dit problem dLai and Robbinsl Il985h and associati ve re- 
inforcement learning with linear reward functions jAuer , 
2003; Chu et al., 2011). 

The performance of algorithms in this setting is typically 
measured by the regret, which is the difference between the 
cumulative reward of the best policy and the algorithm. For 
the setting with an arbitrary set of policies, the achieved re- 
gret guarantee is 0[\jKT ln(A^/^)) where K is the num- 
ber of actions, T is the number of rounds, N is the num- 
ber of polic ies and 5 is the pro b ability of failing to achiev e 
the regret jBevgelzimer et all l201ot IPudfk et all l201lb . 
While this bound has a desirably small dependence on the 
parameters T, N, the scaling with respect to K is often too 
big to be meaningful. For instance, the number of ads un- 
der consideration can be huge, and a rapid scaling with the 
number of alternatives in a clinical trial is clearly undesir- 
able. Unfortunately, the dependenc e on K is unavoid able 



as proved by existing lower bounds ( Auer et aU 120031) . 



Large literature on "linear bandits" manages to avoid this 
dependen ce on K by maki ng additional assu mptions. For 
example, lAuej (I2OO3I) and IChu et al.l ( I2OIII) consider the 
setting where the context x consists of feature vectors Xa £ 
Mf^ describing each action, and the expected reward func- 
tion (given a context x and acti on a) has the fo rm w^x, 



for some fixed vector w S M''. IDani et al. I (l2008 b consider 
a continuous action space with a G R'', without contexts, 
wi th a linear expected reward w'^a, which is generalized 
bv'Filipp i et"al]j2010l) to a{w^a) with a known Lipschitz- 
continuous Unk function a. A striking aspect of the lin- 
ear and generalized linear setting is that while the regret 
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grows rapidly with the dimension d, it grows either only 
gently with the number of actions K (poly-logarithmic 
for Auer. 20 03), or is independent of K (Dani et al., 2008; 
Filippi et all iSoiOl) . In this paper, we investigate whether 



a weaker dependence on the number of actions is possible 
in more general settings. Specifically, we omit the linearity 
assumption while keeping the "realizability" — i.e., we still 
assume that the expected reward can be perfectly modeled, 
but do not require this to be a linear or a generalized linear 
model. 

We consider an arbitrary class _F of functions / : {X,A) — > 
[0, 1] that map a context and an action to a real number. We 
interpret f{x, a) as a predicted expected reward of the ac- 
tion a on context x and refer to functions in F as regres- 
sors. For example, in display advertising, the context is 
a vector of features derived from the text and metadata of 
the webpage and information about the user. The action 
corresponds to the ad, also described by a set of features. 
Additional features might be used to model interaction be- 
tween the ad and the context. A typical regressor for this 
problem is a generalized linear model with a logistic link, 
modeling the probability of a click. 

The set of regressors F induces a natural set of policies 
Tip containing maps tt/ : X ^ A defined as iif{x) = 
argmax^ /(x, a). We make the assumption that the ex- 
pected reward for a context x and action a equals f*{x^ a) 
for some unknown function /* e F. The question we ad- 
dress in this paper is: Does this realizability assumption 
allow us to learn faster? 

We show that for an arbitrary function class, the answer 
to the above question is "no". The ^/K dependence in re- 
gret is in general unavoidable even with the realizability 
assumption. Thus, the structure of linearity or controlled 
non-linearity was quite important in the past works. 

Given this answer, a natural question is whether it is at least 
possible to do better in various special cases. To answer 
this, we create a new natural algorithm, Regressor Elimi- 
nation (RE), which takes advantage of realizability. Struc- 
tu rally, the algorithm is similar to Policy Elimination (PE) 
of iDudfk et al.l (1201 lb . designed for the agnostic case (i.e, 
the general case without realizability assumption). While 
PE proceeds by eliminating poorly performing policies, 
RE proceeds by eliminating poorly predicting regressors. 
However, realizability assumption allows much more ag- 
gressive elimination strategy, different from the strategy 
used in PE. The analysis of this elimination strategy is the 
key technical contribution of this paper. 

The general regret guarantee for Regressor Elimination is 
0{-\/ KT\n{NT/5)), similar to the agnostic case. How- 
ever, we also show that for all sets of policies 11 there exists 
a set of regressors F such that li ^ lip and the regret of 
Regressor Elimination is 0{\n{N / 5)), i.e., independent of 
the number of rounds and actions. At the first sight, this 



seems to contradict our worst-case lower bound. This ap- 
parent paradox is due to the fact that the same set of policies 
can be generated by two very different sets of regressors. 
Some regressor sets allow better discrimination of the true 
reward function, whereas some regressor sets will lead to 
the worst-case guarantee. 

The remainder of the paper is organized as follows. In 
the next section we formalize our setting and assumptions. 
Section[3]provides our algorithm which is analyzed in Sec- 
tion|4] In Section|5]we present the worst-case lower bound, 
and in Section |6] we show an improved dependence on 
K in favorable cases. Our algorithm assumes the exact 
knowledge of the distribution over contexts (but not over 
rewards). In SectionQwe sketch how this assumption can 
be removed. Another major assumption is the finiteness of 
the set of regressors F. This assumption is more difficult 
to remove, as we discuss in Section[8] 

2 Problem Setup 

We assume that the interaction between the learner and na- 
ture happens over T rounds. At each round t, nature picks 
a context xt G X and a reward function rt : A [0, 1] 
sampled i.i.d. in each round, according to a fixed distribu- 
tion D{x, r). We assume that D{x) is known (this assump- 
tion is removed in Section|7ll, but D{r\x) is unknown. The 
learner observes xt, picks an action at € A, and observes 
the reward for the action rt{at)- We are given a function 
class F : X X A [0,1] with |F| = N, where \F\ is 
the cardinality of F. We assume that F contains a perfect 
predictor of the expected reward: 

Assumption 1 (Realizability). There exists a function f* E 
F such that Er|2;[''(a)] = f*{x, a) for all x € X, a £ A. 

We recall as before that the regressor class F induces the 
policy class 11 p containing maps tt/ : X ^ A defined by 
f £ F as TTf{x) ~ argmaXjj f{x, a). The performance of 
an algorithm is measured by its expected regret relative to 
the best fixed policy: 



1 

regretrj.^ sup /* (xj, 7r/(a;f )) - /* (xj, Of ) 
By definition of vr/, this is equivalent to 

T 

regret J, {xt,TTf'{xt)) ~f*{xt, 



t=i 



3 Algorithm 

Our algorithm, Regressor Elimination, maintains a set of 
regressors that accurately predict the observed rewards. In 
each round, it chooses an action that sufficiently explores 
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among the actions represented in the current set of regres- 
sors (Steps 1-2). After observing the reward (Step 3), the 
inaccurate regressors are ehminated (Step 4). 

Sufficient exploration is achieved by solving the convex op- 
timization problem in Step 1 . We construct a distribution 
Pt over current regressors, and then act by first sampling 
a regressor f ^ Pt and then choosing an action accord- 
in g to TTf. Similarly to the Policy Elimination algorithm 
of lDudfketal] ( l201lb . we seek a distribution Pt such that 
the inverse probability of choosing an action that agrees 
with any policy in the current set is in expectation bounded 
from above. Informally, this guarantees that actions of any 
of the current policies are chosen with sufficient probabil- 
ities. Using this construction we relate the accuracy of re- 
gressors to the regret of the algorithm (Lemma [4.3t . 



Algorithm 1 Regressor Elimination 



A priori, it is not clear whether the constraint ( 13.11 ) is even 
fe asible. We prove f easibility by a similar argument as 
in iDudik et al.l (12011) (see Lemm a lA.ll in Appendix |A]l. 
Compared with ^D udrk et alj (12011 ) we are able to obtain 
tighter constraints by doing a more careful analysis. 

Our eliminati on step (Step 4) is s ignificantly tighter than a 
similar step in lDudfketalJ(l20T 1|): we eliminate regressors 
according to a very strict 0{l/t) bound on the suboptimal- 
ity of the least squares error Under the realizability as- 
sumption, this stringent constraint will not discard the opti- 
mal regressor accidentally, as we show in the next section. 
This is the key novel technical contribution of this work. 

Replacing D{x) in the Regressor Elimination algorithm 
with the empirical distribution over observed con texts is 
straightforward, as was done in iDudfk et al.l (1201 lb . and is 
discussed further in Section|2l 



Input: 

a set of reward predictors F = {/ : {X, A) — > [0, 1]} 
distribution D over contexts, confidence parameter S. 

Notation: 

TTf{x) argmax^, f{x,a'). 

RtU) --^ ^ELii fi^t', at')- rt4at')r- 
For F' C F, define 

A{F',x) := {a e A: iTf{x) = a for some / G F'} 

/i:=min{l/2ii:, l/VT}. 

For a distribution P on F' C F, define conditional distri- 
bution P'{-\x) on A as: 

w.p. (1 — /i), sample f ^ P and return TTf{x), and 
w.p. fi, return a uniform random a E A{F' , x). 

dt = S/2Nt^ log2(i), for t = 1, 2, . . . , T. 

Algorithm: 

Fo^F 

Fori = l,2,...,r: 



1. Find distribution Pt on Ft-i such that 
1 



V/ e Ft-i : E 



<I^[\AiFt-i,x)\] (3.1) 



[Pli7Tfix)\x)\ 

2. Observe xt and sample action at from P/( |xt). 

3. Observe rt{at). 

4. Set 

181n(l/(5t) 



Ft = \feFt^i : Rtif) <mmRt{.n 



t 



4 Regret Analysis 

Here we prove an upper bound on the regret of Regressor 
Elimination. The proved bound is no better than the one for 
existing agnostic algorithms. This is necessary, as we will 
see in Section|5] where we prove a matching lower bound. 

Theorem 4.1. For all sets of regressors F with \F\ = N 
and all distributions D(x,r), with probability 1 — 6, the 
regret of Regressor Elimination is 0{^J KT lii{NT / S)). 

Proof By Lemma 14.11 (proved below), in round t if we 
sample an action by sampling / from Pt and choosing 
TTf{xt), then the expected regret is 0{y/ K ln{NT/6)/t) 
with probability at least 1 — S/2P. The excess re- 
gret for sampling a uniform random action is at most 
/i < per round. Summing up over all the 

T rounds and taking a union bound, the total ex- 
pected regret is 0{^jKT \n{NT/5)) with probability 
at least \ — 5. Further, the net regret is a martin- 
gale; hence the Azuma-Hoeffding inequality with range 
[0, 1] apphes. So with probability at least 1 — J we 



have a regret of KT\n{NT / 6) + y^TMlJS)) = 
0{^KT\n{NT/S)). □ 

Lemma 4.1. With probability at least 1 — StNt\og2{t) > 
1 — 6/21"^, we have: 

1. f* e Ft. 

2. For any f g Ft, 

E[r{^f{x))-r{nf.{x))] < ^ ^1^. 

Proof. Fix an arbitrary function f E F. For every round t, 
define the random variable 

Yt = {f{xt,at) - rt{at)f - {f*{xt.at) - rt{at)f . 

Here, xt is drawn from the unknown data distribution D, 
rt is drawn from the reward distribution conditioned on xt, 
and at is drawn from P[ (which is defined conditioned on 
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the choice of xt and is independent of rt). Note that this 
random variable is well-defined for all functions f £ F, 
not just the ones in Ft- 

Let Et[-] and Vart[-] denote the expectation and variance 
conditioned on all the randomness u p to round t. Using 
a form of Freedman's inequality from lBartlett et alj (120081) 
(see Lemma IbTTT i and noting that Yt < 1, we get that with 
probability at least 1 — St log2(i), we have 



< 4^ Vart, [Yt,] Hl/5t) + 2 Hl/St). 
\ t'=i 

From Lemma|42] we see that Varf [Yf] < 4 Ej/ [Yf] so 

t'=i t'=i 



\ t'=i 



< 



For notational convenience, define X = \JY1\'=i ^t' [^']' 

Z = Yf, and C = ^J\a{l/ St). The above inequal- 

ity is equivalent to: 

~ Z < &CX + 2C^ ^ {X ~ ACf - Z < 18C^ 

This gives -Z < ISC^. Since Z = t{Rt{f) - Rtif*)), 
we get that 



Rt{n < Rtif) 



t 



By a union bound, with probability at least 1 — 
5tNt log2(t), for all / e and all rounds t' < t, we have 



Rt'ifl < Rt'{f) + 



181n(l/5t 
t' 



and so /* is not eliminated in any elimination step and re- 
mains in Ft . 

Furthermore, suppose / is also not eliminated and survives 
in Ft. Then we must have Rt{f) - Rt{f*) < ISC^/t, or 
in other words, Z < IBC^. Thus, {X ~ 4C)^ < SGC^, 
which impUes that X^ < lOOC^, and hence: 



J2^AYt'] < lOOln(lA). 



(4.1) 



By Lemma [43] and since Pt is measurable with respect to 
the past sigma field up to time t — 1, for all t' < twe have 

E[r{nj{x))-r{nj,{x))]' < 2K Et' [Yf]. 



Summing up over all t' < t, and using (14.1b along with 
Jensen's inequality we get that 



E[r(7r/(x)) -r(7r/.(a;))] < 



200Kln{l/ St) 
t 



□ 



Lemma 4.2. Fix a function f (z F. Suppose we sample 
X, r from the data distribution D, and an action a from an 
arbitrary distribution such that r and a are conditionally 
independent given x. Define the random variable 

Y^{f{x,a)~r{a))'-{r{x,a)-r{a))^. 

Then we have 

E E [(/(x,a)-r(x,a))2] 

x,r.a x.a 

Var[r] < 4 E [Y]. 

x,r,a x,r,a 

Proof. Using shorthands f^a for f{x,a) and Va for r(a), 
we can rearrange the definition of Y as 

Y = if.a - fDiUa + IL - 2r,) . (4.2) 
Hence, we have 

E [r] = E [(/.a - fDif.a + - 2ra)] 

x,r,a x,r,a 

= E E [(Ua-fDiha + fL-^Ta)] 

- E [(/.a-./:a)(./.a + /*a-2EW 

— E [[fxa — fxaT\ ' 
x.a 

proving the first part of the lemma. From ( 14.21) . noting that 

fxa, fxa, ''a ^1"^ between and 1, we obtain 

Y < (/sa ~ fxa)''ifxa + fxa ~ '^^a) 
5: "iifxa ~ fxa) ' 

yielding the second part of the lemma: 

Var[y] < E [Y'] < 4 E [{fxa - /*J'] 

x,r,a x,r.a x,r,a 

= 4 E [F] . □ 

x.r,a 

Next we show how the random variable Y defined in 
Lemma|42]relates to the regret in a single round: 

Lemma 4.3. In the setup of Lemma \4.2\ assume further 
that the action a is sampled from a conditional distribution 
p{-\x) which satisfies the following constraint, for f = f 
and /' = /*.• 



E 



1 



p{T:f,{x)\x) 



< K. 



(4.3) 



Then we have 



E 

x.r 



r{TTf.{x)) ~ r{TTf{x)) < 2ii' E \Y] 
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T his lemma is essentially a r e fined form of theorem 6.1 5 Lower bound 



m iBevgelzimer and LangfordI (120091) which analyzes the 
regression approach to learning in contextual bandit set- 
tings. 

Proof. Throughout, we continue using the shorthand f^a 
for f(x,a). Given a context x, let a — 7r/(x) and a* = 
TTf* (x). Define the random variable 



= E 

r\x 



Note that A^^ > because /* prefers a* over a for con- 
text X. Also we have f^h > fxa* since / prefers a over a* 
for context x. Thus, 



fxd fxd ~^ fxa* fxa' ^ A^ 

As in proof of Lemma l4!2l 



(4.4) 



E m - E [{fxa - fla?\ 
r.a\x a\x 



> 



p{a\x)p{a*\x) 



p{a\x) +p{a*\x) 
The last inequality follows by first applying the chain 



(4.5) 



ax + by~ 



ab{x + y)^ + ( ax — by)^ 



> 



ab 



a + b a + b 

(valid for a,b> 0), and then applying inequality (14.4b . 
For convenience, define 

p{d\x)p(a*\x) .11 1 
— -, I.e., — ' 



p{a\x) +p{a*\x) 



Qx p{a\x) pia*\x) 



Now, since p satisfies the constraint ( 14. 3t for f = f and 
/' — /*, we conclude that 



E 



E 



p{a\x) 



E 



p{a*\x) 



< 2K . (4.6) 



We now have 



E[A^] 



E 



< E 



1 



Qx^x 



e[Q.a2 



<2K -E [Y] , 

where the first inequality follows from the Cauchy- 
Schwarz inequality and the second from the inequalities 
(l431 i and gM- □ 



Here we prove a lower bound showing that the realizability 
assumption is not enough in general to eliminate a depen- 
dence on the number of actions K. Th e struct ure of this 
proof is similar to an earlier lower bound dAuer et al., 20031) 



differing in two ways: it applies to regressors of the sort we 
consider, and we work N, the number of regressors, into 
the lower bound. Since for every policy there exists a re- 
gressor with argmax on that regressor realizing the policy, 
this lower bound also applies to policy based algorithms. 

Theorem 5.1. For every N and K such that In N/ \nK < 
T, and every algorithm A, there exists a function class F of 
cardinality at most N and a distribution D{x, r) for which 
the realizability assumption holds, but the expected regret 
of A is n{^/KTliiN/\iiK). 

Proof. Instead of directly selecting F and D for which the 
expected regret of A is fl(y/KT In N/ In K), we create a 
distribution over instances {F, D) and show that the ex- 
pected regret of A is Q,{^J KT \nN/\nK) when the ex- 
pectation is taken also over our choice of the instance. This 
will immediately yield a statement of the theorem, since 
the algorithm must suffer at least this amount of regret on 
one of the instances. 

The proof proceeds via a reduction to the c onstruction 
used i n the lower bound of Theorem 5.1 of Auer et ajj 
(l2003b . We will use M different contexts for a suitable 
number M. To define the regressor class F, we begin with 
the policy class G consisting of all the K^'^ mappings of 
the form g : X ^ A, where X = {1,2,..., A/} and 
A = {1,2,..., K}. We require M to be the largest inte- 
ger such that if*^ < N, i.e., M = [In TV/ In if J . Each 
mapping g ^ G defines a regressor fg £ F as follows: 



/ff(a;,a) 



1/2- 
1/2 



if a = g{x) 
otherwise. 



The rewards are generated by picking a function f E F 
uniformly at random at the beginning. Equivalently, we 
choose a mapping g that independently maps each context 
x Q X to a random action a E A, and set f = fg. In each 
round t, a context xt is picked uniformly from X. For any 
action a, a reward rt{a) is generated as a {0, 1} Bernoulli 
trial with probability of 1 being equal to f{x, a). 

Now fix a context x ^ X. We condition on all of the ran- 
domness of the algorithm A, the choices of the contexts xt 
for t = 1,2, ... ,T, and the values of g{x') for x' ^ x. 
Thus the only randomness left is in the choice of g{x) and 
the realization of the rewards in each round. Let P' denote 
the reward distribution where the rewards of any action a 
for context x are chosen to be {0, 1} uniformly at random 
(the rewards for other contexts x' ^ x are still chosen ac- 
cording to f{x', a), however), and let E' denote the expec- 
tation under P'. 
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Let Tx be the rounds t where the context xt is x. Now fix 
an action a ^ A and let Sa be a random variable denoting 
the number of rounds t ^ when A chooses at = a. 
Note that conditioned on g{x) = a, the random variable 
Sa counts the number of rounds in that A chooses the 
optimal action a. 



We use a corollary of Lemma A. 1 in lAuer et al.l (l2003h : 



Corollary 5.1 dAuer et al.L l2003h . Conditioned on the 
choices of the contexts Xt for t ~ 1,2, ... ,T, and the val- 
ues of g(x') for x' ^ x, we have 



E[5Jg(x) = a] < E'[5a] + \T^AV^^^¥W]- 

The proof uses the fact that when g{x) = a, rewards chosen 
using P' are identical to those from the true distribution 
except for the rounds when A chooses the action a. 

Thus, if Nx is a random variable that counts the number 
the rounds in that A chooses the optimal action for a; 
(without conditioning on g{x)), we have 

B[N,] = E [E[5g(,)]] 



< E 

9(2:) 

< E 

9(2:) 



E'[5,(,)] + |r,|j2e2E'[5<,(, 



9{x) 



by Jensen's inequality. Now note that 



E [E'[5<,(,)]j = E 

g{x) L J g(x) 


E' 




= E 

teT^ 


E'[ E [ 


l{at = g{x)}]] 


= E 


E' 


1 ■ 

K 


\T.\ 
K 



The third equality follows because g{x) is independent of 
the choices of the contexts xt for t — 1,2, ... ,T, and g{x') 
for x' 7^ X, and its distribution is uniform on A. Thus 



E[iVj < 



K 



\TJ\ 2€ 



K 



Since in the rounds in T^. \ N^, the algorithm A suffers an 
expected regret of e, the expected regret of A over all the 
rounds in is at least {e\Tx\ - -^|T^p/2^ . Note that 

this lower bound is independent of the choice of g{x') for 
x' ^ X. Thus, we can remove the conditioning on g{x') for 
x' ^ X and conclude that only conditioned on the choices 
of the contexts xt for t = 1,2, ... ,T, the expected re- 
gret of the algorithm over all the rounds in is at least 

{e-\Tx \ — ■ Summing up over all x, and re- 

moving the conditioning on the choices of the contexts xt 



for t ~ l,2,...,rby taking an expectation, we get the 
following lower bound on the expected regret of A: 



Note that \Tx\ is distributed as Binomial(T, 1/M). Thus, 
E[|T!j;|] = T/A4. Furthermore, by Jensen's inequality 



T_^3r(T-i) ^ r(r-i)(r-2) 



< 



M NP 
75T3/2 
M3/2 ' 



M3 



1/2 



as long as M < T. Plugging these bounds in, the lower 
bound on the expected regret becomes 



VKM 



_ 2^3/2 



Choosing e = ( y/ KM/T^ , we get that the expected re- 
gret of A is lower bounded by 



□ 



n{VKMT) = Vt{yjKT\nN/\iiK) 

6 Analysis of nontriviality 



Since the worst-case regret bound of our new algorithm is 
the same as for agnostic algorithms, a skeptic could con- 
clude that there is no power in the realizability assumption. 
Here, we show that in some cases, realizability assumption 
can be very powerful in reducing regret. 

Theorem 6.1. For any algorithm A working with a set of 
policies (rather than regressors), there exists a set of re- 
gressors F and a distribution D satisfying the realizability 
assumption such that the regret of A using the set Hp is 
fl{\/TK IniV), but the expected regret of Regressor Elim- 
ination using F is at most 0(ln{N/S)y 

Proof. Let F' be the set of functions and D the data dis- 
tribution that achieve the lower bound of Theorem 15. II for 
the algorithm A. Using Lemma l6n (see below), there ex- 
ists a set of functions F such that Hp = Hp/ and the ex- 
pected regret of Regressor Elimination using F is at most 
0(ln{N/S)). This set of functions F and distribution D 
satisfy the requirements of the theorem. □ 

Lemma 6.1. For any distribution D and a set of policies H 
containing the optimal policy, there exists a set of functions 
F satisfying the realizability assumption, such that H = 
Hp and the regret of regressor elimination using F is at 
mostO{\n{N/6)). 
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Proof. The idea is to build a set of functions F such that 
n = Hi?, and for the optimal policy tt* the correspond- 
ing function /* exactly gives the expected rewards for each 
context X and a, but for any other policy tt the correspond- 
ing function / gives a terrible estimate, allowing regressor 
elimination to eliminate them quickly. 

The construction is as follows. For vr*, we define the func- 
tion /* as f*{x,a) = E2;,r[''(a)]. By optimality of tt*, 
TTf* = TT*. For every other policy tt we construct an / 
such that TT = TTf but for which f{x,a) is a very bad 
estimate of Eaj.ri'^Ca)] for all actions a. Fix x and con- 
sider two cases: the first is that Er|a;[''(7''(x))] > 0.75 
and the other is that Er|a;[f ('''(a:^))] < 0.75. In the first 
case, we let /(x, 7r(x)) — 0.51. In the second case we let 
/(x,7r(x)) — 1.0. Now consider each other action a' in 
turn. If Er\x[ria')] > 0.25 then we let f{x, a') = 0, and if 
Er\x[ria')] < 0.25 we let /(x, a') = 0.5. 

The regressor elimination algorithm eliminates regressor 
with a too-large squared loss regret. Now fix any policy 
TT ^ TT*, and the corresponding /, define, as in the proof of 
Lemma l4n the random variable 



Yt = {f{xt,at) - rt{at)Y - {f*{xt,at) - rt{at))^. 
Note that 

^t[Yt] = E [ifixt,at) - r{xt,at)f] > -L, (6.1) 

xt,at ZU 

since for all (x, a), (/(x, a) — /* (x, a))^ > ^ by construc- 
tion. This shows that the expected regret is significant. 

Now suppose / is not eliminated and remains in Ft. Then 
by equation |4T| we get: 



20 



< Y^Et'iYt'] < 1001n(l/(5t). 



The above bound holds with probability 1 — 5tNt\og2{t) 
uniformly for all / G Ff . Using the choice of 6t — 
6/2Nt^ log2(t), we note that the bound fails to hold when 
t > 106ln(iV/(5). Thus, within 10*^ ln(iV/^) rounds all 
suboptimal regressors are eliminated, and the algorithm 
suffers no regret thereafter Since the rewards are bounded 
in [0, 1], the total regret in the first 10^ \n{N/6) rounds can 
be at most 10^ \n{N/6), giving us the desired bound. □ 

7 Removing the dependence on D 



While Algorithm[T]is conceptually simple and enjoys nice 
theoretical guarantees, it has a serious drawback that it de- 
pends on the distribution D from which the contexts xt's 
are drawn in order to specify the const raint (13. 111. A similar 
issue was faced in the earlier work o fi Dudik et al.l(l201ll) . 
where they replace the expectation under D with a sample 



average over the contexts observed. We now discuss a sim- 
ilar modification for Algorithm [T] and give a sketch of the 
regret analysis. 

The key change in Algorithm [T] is to replace the con- 
straint (13.11) with the sample version. Let Ht = 
{xi, X2, . . . , Xt-i}, and denote by x ^ iJt the act of se- 
lecting a context x from Ht uniformly at random. Now we 
pick a distribution Pt on Ft-i such that 



V/ e Ft-i : E 

x~Ht 



1 



.A'(T/(a:)|x). 



< E [\AiFt-i,'. 

x~Ht 



(7.1) 



Since Lemma lA.ll applies to any distribution on the con- 
texts, in particular, the uniform distribution on Ht, this 
constraint is still feasible. To justify this sample based 
approx imation, we appeal to Theorem 6 of iDudfk et al] 
(|201ll) which shows that for any e e (0,1) and t > 
16Kln{8KN/6), with probability at least 1-5 



E 



P/(^/(x)|x) 
<(l + e) E 

x^Ht 



1 



7500 



K. 



Pl{ii}{x)\x) 

Using Equation ( 17. It , since Xj/)| < K, we get 

1 



E 



Pi{^f{x)\x) 



< 7525K, 



using e — 0.999. The remaining analysis of the algo- 
rithm remains the same as before, except we now apply 
Lemmal43] with a worse constant in the condition (14.31). 



8 Conclusion 

The included results gives us a basic understanding of the 
realizable assumption setting: it can, but does not necessar- 
ily, improve our ability to learn. 

We did not address computational complexity in this pa- 
per. There are some reasons to be hopeful however. Due 
to the structure of the realizability assumption, an elimi- 
nated regressor continues to have an increasingly poor re- 
gret over time, implying that it may be possible to avoid the 
elimination step and simply restrict the set of regressors we 
care about when constructing a distribution. A basic ques- 
tion then is: can we make the formation of this distribution 
computationally tractable? 

Another question for future research is the extension to in- 
finite function classes. One would expect that this just in- 
volves replacing the log cardinality with something like a 
metric entropy or Rademacher complexity of F. This is not 
completely immediate since we are dealing with martin- 
gales, and direct application of covering arguments seems 
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to yield a suboptimal 0{l/\/t) rate in Lemma 143] Ex- 
tending the variance based bound coming from Freedman's 
inequality from a single martingale to a supremum over 
function classes would need a Talagrand-style concentra- 
tion inequality for martingales which is not available in the 
literature to the best of our knowledge. Understanding this 
issue better is an interesting topic for future work. 



Proof. Let At_i refer to the space of all distributions on 
Ft-i- We observe that At_i is a convex, compact set. For 
a distribution Q e A(_i, define the conditional distribution 
Q{-\x) on A as sample f ^ Q, and return Trf{x). Note 
that Q'{a\x) = (1 — ^i)Q{a\x) + ii/K^, where := 
\A{Ft-i, x) I for notational convenience. 



The feasibility of constraint (13.11) can be written as 
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mm 



max E 



1 



P/(7r/(x)|x) 



<E[|A(F,_i, :.)!]. 
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A Feasibility 

Lemma A.l. There exists a distribution Pt on Ft^i satis- 
fying the constraint ( 13. II) . 



min max E 



V- QU) 



where we recall that P/ is the distribution induced on A by 
Pt as before. The function 



E 



$Z pi 



QU) 



Plinfix)\x) 



is linear (and hence concave) in Q and convex in Pf. Ap- 
plying Sion's Minimax Theorem (stated below as Theo- 
rem lA.ll ). we see that the LHS is equal to 



max min E 

QeAt-i PteAt-i X 



^ P' 



Qif) 



Pa7Tf{x)\x) 



< max E 

QeAt_i X 



max E 

QeAt_i X 



max E 

QeAt-i X 



max E 

Q(EAt_i X 



Qif) 



E E 

aeA{Ft-i,x) feFt-i:Trf{x)=a 

Q{a\x] 



Qif) 

Q'ia\x) 



E 

aGA{Ft-i,x) 
1 



7{a\x) 



E 

aGA{Ft-i,x) 



K,Q'{a\x) 



< max E \Kx] 

QeAt-i X 



The last inequality uses the fact that for any distribution 



P on {1, 2, . . . , K}, is minimized when all 

P{i) equal 1/K. Hence the constraint is always feasible. 

□ 



Theorem A.l (see Theorem 3.4 of lSionL[l958l) . Let U and 

V be compact and convex sets, and <f>: UxV^^a 
function which for all v is convex and continuous in u 
and for all u ^ U is concave and continuous in v. Then 



min max (h(u,v) — max min (biu.v) 
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B Freedman-style Inequality 



Lemma B.l (see iBartlett et al 1 l2008h . Suppose 



Xi, X2, ■ ■ ■ , Xt is a martingale difference sequence 
with \Xt\ < b for all t. Let V = Vart[Xt] be the 

sum of conditional variances. Then for any S < 1/e^, with 
probability at least 1 — log2(T)(5 we have 

T 

< 4^/VHT/S) + 2b\n{l/S). 



